Mobile App Behavior Analysis Part 1

Exploratory Data Analysis

Q1. Why is hour column not in the list of the describe function?

A1. Since hour column is not in integer format therefore it is not listed in describe function. Describe() is used to view some basic statistical details like percentile, mean, std. etc of a data frame and/or a series of numeric values and it excludes the charcater columns (categorical).

Data Pre-processing

Plots

Q2. What do you observe

A2.

  1. Most of the Users opened their accounts on 'Sunday' and 'Thursday' while the lowest number of accounts opened was on 'Tuesday'.
  2. Most accounts were opened around 1:00, 15:00 and (21:00 - 23:00), while it was comparatively lower in the morning from (5:00 - 13:00 ).
  3. Majority of the users are aged between 15-40 years.
  4. A large proportion of users visited approximately 40 screens in their first 24 hours.
  5. The ratio of number of users who did play games as compared to the ones that did not is very small.
  6. There were around 42000 users who have not used any premium feature in their free trial period. However, around 8000 users used premium features at that time.
  7. Nearly 85% of the users did not click the like button even once.

Q3. Comment on the plot

A3. The above plot shows the correlation between Response Variable i.e 'enrolled' and other numeric variables.There are three variables ( 'numscreens', 'minigame', 'dayofweek') which are positively correlated with 'enrolled ' variable. 'numscreens' is high correlated. On the other hand, There is a negative correlation between the variables 'hour', 'age', 'used_premium_feature' and 'liked' with the response variable. In which, 'Age' variables is highly negative correlated (i.e -0.15) with the response variable.

Correlation Plot the nicer way

Q4. Comment on the heat map

From the heat map we can see that there is almost no correlation between the fields. Among all the fields, there is slight correlation between used_premium_feature and minigame (0.108 as correlation coefficient). Where as there is slight negative correaltion between age and numscreens (-0.128 as correlation coeffient).

Feature Engineering Process

Q5. Why are we using dropna() here

We have a lot of users data, in which we do not have their exact enrolled date. we can not get the time difference between enrolled date and first open date for these users. Therefore we are dropping the values in which we do not have enrolled date.

Q6. Why are we using the range in histogram

We have used the range function for subseting our dataset. The histogram has been plotted only for those observations in which 'difference' feature has value between 0 and 48.

Q7. Comment on the new time distribution

It can be clearly visualized that mostly users enroll within 5 hours of opening the app.

Q8. What is the purpose of the above code?

we have assigned 0 to the enrolled feature, only for those observations where the 'difference' column has value greater than 48.

Extra Feature Engineering Screens

Separate screens into the separate lists

Create Funnels

Mobile App Behavior Analysis Part 2

Removing Identifiers

Feature Scaling

Model Building

Q1. Does the model hold true?

Yes, the model holds true. Our model is giving 76% accuracy which is quite good. We have checked this accuracy by confusion matrix and cross_val_score.

Model Conclusion

Q2. Provide recommendation to the marketing team based on the results. What has this model given us?

A2.

This model has provided us with a very important column called predicted_results. This column will help the company understand which customer is going to stay a paid customer and which customer is going to leave which will help them understand how to vary their marketing strategies.

We can see that currently the financial situation for this app is very bleak and it needs to improve for the app to merely stay in the game. In order to do that several effective promotion and marketing strategies need to be employed in terms of segmentation of the age groups, as well as improving model accuracy to predict the status of the customers in the long term.